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ABSTRACT 

Despite having various attractive qualities such as high prediction 
accuracy and the ability to quantify uncertainty and avoid over¬ 
fitting. Bayesian Matrix Factorization has not been widely adopted 
because of the prohibitive cost of inference. In this paper, we 
propose a scalable distributed Bayesian matrix factorization algo¬ 
rithm using stochastic gradient MCMC. Our algorithm, based on 
Distributed Stochastic Gradient Langevin Dynamics, can not only 
match the prediction accuracy of standard MCMC methods like 
Gibbs sampling, but at the same time is as fast and simple as stochas¬ 
tic gradient descent. In our experiments, we show that our algo¬ 
rithm can achieve the same level of prediction accuracy as Gibbs 
sampling an order of magnitude faster. We also show that our 
method reduces the prediction error as fast as distributed stochastic 
gradient descent, achieving a 4.1% improvement in RMSE for the 
Netflix dataset and an 1.8% for the Yahoo music dataset. 

1. INTRODUCTION 

Recommender systems have become a pervasive tool in industry 
to understand customers and their interests in products. Examples 
range between music recommendation (Pandora), book recommen¬ 
dation (Amazon), movie recommendation (Netflix), news recom¬ 
mendation (Yahoo) to partner recommendation (eHarmony). Rec¬ 
ommender systems represent a personalized technology that can 
help filter at an individual level the enormous amounts of informa¬ 
tion that is available to us. Given the exponential growth of data, 
recommender systems are likely to play an increasingly important 
role to manage our information streams. 

During 2006-2011 Netflix EDU ran a competition where teams 
around the world could develop and test new recommender tech¬ 
nology on Netflix movie rating data. A few valuable lessons were 
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learnt from that exercise. First, matrix factorization methods work 
very well compared to nearest neighbor type models. Second, av¬ 
eraging over many different models pays off in terms of prediction 
accuracy. One particularly effective model was Bayesian proba¬ 
bilistic matrix factorization (BPMF) 1281 where predictions are av¬ 
eraged over samples from the posterior distribution. Besides im¬ 
proved prediction accuracy, a full Bayesian analysis also comes 
with additional advantages such as probabilities over models, con¬ 
fidence intervals, robustness against overfitting, and incorporating 
prior knowledge and side-information (S125). 

Unfortunately, since the number of user-product interactions can 
easily run into the billions, posterior inference is usually too ex¬ 
pensive to be practical. Learning at that scale requires data and 
computation to be distributed over many machines and learning 
updates to only depend on small minibatches of the data. Effec¬ 
tive distributed learning algorithms have been devised for alternat¬ 
ing least squares (ALS) and stochastic gradient descent (SGD) m 
l26ll3ni29ll23lfT6ll20lf79ll32l. In particular, Distributed Stochastic 
Gradient Descent (DSGD) fl~4l has achieved a significant speed¬ 
up by assigning partitioned rating matrix blocks to workers and 
then by updating some “orthogonal” blocks in parallel using “strat¬ 
ified” SGD. DSGD outperformed other parallel SGD approaches 
such as PSGD QfDHSl and ISGD fl9ll3?l where SGD is applied 
also on some subsets of the ratings while synchronizing globally 
after each sub-epoch (PSGD) or once at the end of the training 
(ISGD). Unfortunately, so far it has proven difficult to apply these 
advances in distributed learning to posterior sampling in Bayesian 
matrix factorization models. For instance, for BPMF which re¬ 
quires 0((L + M)D' i ) computation per iteration (with L and M 
are number of users and items, and D is latent feature dimension), 
distributed computation has not nearly been as effective. 

In this paper, we propose a scalable and distributed Bayesian ma¬ 
trix factorization method which combines the predictive accuracy 
of Bayesian inference and the learning efficiency of stochastic gra¬ 
dient updates. To this end, we extend a recently developed MCMC 
method, called Stochastic Gradient Langevin Dynamics (SGLD) 
1301. so that the updates become efficient in the setting of dis¬ 
tributed, large-scale matrix factorization. We adapt the SGLD up¬ 
dates to make them suitable for distributed learning on subsets of 
users and products (or blocks). Each worker manages only a small 
block of the rating matrix, and updates and communicates only a 
small subset of the parameters in a fully-asynchronous or weakly- 
synchronous fashion. Unlike distributed SGD where a single model 
is learnt, our method deploys multiple parallel chains over workers. 
Consequently, samples are collected at a much faster rate than ordi- 



nary MCMC and the multiple parallel chains can explore different 
modes of parameter space. Both features are reducing the variance 
and increasing the accuracies of our predictions. 

In the experiments on the Netflix and Yahoo music datasets (the 
latter being one of the largest publicly available dataset for rec¬ 
ommendation problems), we show that our method achieves the 
same level of accuracy as BPMF but an order of magnitude faster. 
Reversely, at almost the same efficiency as distributed SGD, our 
method achieves much better accuracy (4.1% RMSE improvement 
for the Netflix dataset and 1.8% for Yahoo music dataset). As such 
we believe that the method proposed in this paper is currently the 
most competitive matrix factorization method for industry scale 
problems. 

2. PRELIMINARIES 

2.1 Bayesian Matrix Factorization 

Suppose we have L users and M items. Our goal is to learn 
latent feature vectors Ui,Vj £ R D such that the rating Rij for 
item j by user i can be predicted as Rij « U{ Vj. We denote the 
entire rating matrix by R £ R ixM , and the latent feature matrices 
by U £ R DxL and V £ R DxM , so that R « U T V. Assuming 
a Gaussian error model, the likelihood of the parameters U and V 
can be written as: 

l m T 

P (Riu,v,r) = nnK^i^^i(i) 

*=ii=i L 

where is equal to 1 if user i rated item j and 0 otherwise. 
Throughout the paper, we fixed r = 1 for simplicity Although, 
in theory, U and V can be learned by maximizing the likelihood 
above, this results in severe over-fitting because only a few ratings 
are known (i.e. R is very sparse). 

Therefore, a Bayesian Probabilistic Matrix Factorization (BPMF) 
model was proposed to overcome this problem 1281 . In addition 
to controlling over-fitting through posterior averaging, BPMF also 
provides estimates of uncertainty through the posterior predictive 
distribution. The BPMF model as proposed in HD is as follows. 
We place priors on U and V as: 


L 


p(\J\pu,Au) 

= n^i^.A-), 

2 = 1 

(2) 

P(V \pv,A v ) 

M 

= rKra/w-A-). 

(3) 


3=1 


We further place Gaussian-Wishart hyper-priors on the user and 
item hyperparameters Qu = {pu, Ac/} and ©v = {pv, Ay}: 

p(0c/|0o) = Af(pu\po, (/SoAy) 1 )W’(Ac/|W / o, t'o), (4) 
p(0v"|0o) = Af{pv\po, (/3oAv) 1 )W’(Av|H / o, vo), (5) 

where vq is the number of degrees of freedom and Wo is a D x D 
scale matrix. We collectively denote the parameters of the hyper¬ 
prior by 0 O = {no, Po, vo, Wo}. 

At test time, the predictive distribution of an unknown rating 
R*j can be obtained by marginalizing over both model parameters 
U, V and hyper-parameters 0c/, Qv, 

p(RSj |R,0o) = J J p(R*ij\Ui,Vj)p(V ,V\R, ®u,®v) 

p(Qu, 0v|0o)d{U, V}d{0c/, 0v} (6) 

'All update equations are derived with r = 1. 


Algorithm 1 Gibbs Sampling for BPMF 

1: Initialize model parameters U (1) , V a) 

2: for t = 1 : T do 

3: // Sample hyperparameters 

0® ~p(0c/|U» 0o), 0® ~ p(0y |V (t) , 0 O ) 

4: for i = 1 : L in parallel do 

5: C/f +1) ~ p(Ui\Tl, V (t) , 0®) // sample user features 

6: end for 

7: for j = 1 : M in parallel do 

8: Vj t + 1) ~ p(Vj|R, U (t) , Qy) // sample item features 

9: end for 

10: end for 


We can estimate this using a Monte Carlo approximation: 

1 T 

p(RZj |R,e 0 ) » (7) 

t=1 

where {Uf, Vf } is the f-th sample from the posterior distribution: 

p(U,V,0c/,0v|R, 0o). (8) 

These samples can be generated using Gibbs sampling (Algorithm 
[TJ, since by conjugacy the conditional distributions of Ui and Vj are 
Gaussian, and those of 0c/ and 0 v are Gaussian-Wishart. How¬ 
ever, sampling from the conditional distribution of Ui or Vj in¬ 
volves 0(D 3 ) computations (for inverting a D x D precision ma¬ 
trix) and since this has to be done for each user and item, results in 
a total of 0((L + M)D 3 ) computations per iteration. Thus, BPMF 
using Gibbs sampling cannot scale up to real world recommender 
systems with millions of users and / or items. 

Although it is possible to parallelize BPMF using MapReduce 
style synchronous global updates, the cubic order complexity still 
limits its applicability to small D. Also, we require a large number 
of workers to effectively distribute the L + M cubic-order compu¬ 
tations. Furthermore, since running the Gibbs sampler from scratch 
is too expensive, a separate SGD optimizer is usually deployed to 
reach near the Maximum-a-Posteriori (MAP) state before starting 
the Gibbs sampler. However, running two different large-scale dis¬ 
tributed algorithms, each of which requires different optimal set¬ 
tings for the distribution of data and parameters, as well as cluster 
architectures, adds another considerable level of complexity. 

2.2 Stochastic Gradient Langevin Dynamics 

Assume we have a dataset of N i.i.d. data points, denoted by 
X = {x n }n = i , which we model using a distribution p(x\9) pa¬ 
rameterized by 9 £ R d . We choose a prior distribution p(9) and 
our goal is to sample from the posterior distribution p(9\X) oc 
p(X\9)p(9) using MCMC. 

One way of obtaining efficient MCMC proposals is to use the 
gradient of the target density 

GHEDimin), eg ' Langevin Dy¬ 
namics 03 is an MCMC algorithm which proposes candidate states 
according to: 

9 t+ i «- 9 t + ^ | Ve f log p(9 t ) + ^ g(9; x) i + v t 
l x£X ) 

where u t ~ AfO, e t I) (9) 

In the above, et is the step size and g(9;x) = Vt> logp(a;|6 l ) is 
the score. A Metropolis-Hastings (MH) test is then used to decide 
whether to accept or reject the proposal. The gradient information 
allows the Langevin algorithm to make proposals to high density 








regions and therefore have a high probability of acceptance. How¬ 
ever. in large-scale problems where N = \X | can be very large, the 
O(N) computations per update, required for computing the gradi¬ 
ent as well as for the MH test, is infeasible. 

Stochastic Gradient Langevin Dynamics (SGLD) Go) is the first 
in a line of recently developed approximate MCMC algorithms [3] 
mmmm that try to address this issue using noisy gradients that 
can be cheaply computed from a mini-batch of n <C TV data points. 
SGLD uses the following update rule: 

dt+i <- 9 t + ^ {V 0t log p(9 t ) + Ng(9 t ;Mt)} + vt- (10) 

Her eg(9 t ;Mt) = g(x;9t), the mean score computed 

from a mini-batch Aft. SGLD converges to the true posterior dis¬ 
tribution if the step size is annealed to zero at a rate that satisfies 
the following conditions: 

OO OO 

^et = °°, ^e?<oo. (11) 

t=i t=i 

SGLD does not use accept-reject tests because the acceptance rate 
tends to one as the step size goes to zero. Therefore, unlike tra¬ 
ditional MCMC algorithms which require O(N) computations per 
iteration, SGLD requires only 0(n ) computations. 

More generally, it is valid to replace g(9t;A4t) in eqn. 1 10| with 
any estimator f(9, Z-X) that satisfies the following conditions: (i) 
it is an unbiased estimator of the true gradient i.e. E z[f(9, Z\X)] = 
g(9 ; X) (ii) it has finite variance Vz[/(0, Z\ X)] < oo. Here, the 
expectation and variance are w.r.t. the distribution p(Z; X) of the 
auxiliary random variable Z. 

Distributed SGLD (DSGLD) (6) further extends the power of 
stochastic gradient MCMC using distributed computing. In DS¬ 
GLD, the dataset is first partitioned and distributed to S workers. 
Then, multiple chains collect samples in parallel by sampling for 
the length of a round (called a trajectory) at a worker. After a round, 
each chain switches to a different worker. In 06), it is shown that us¬ 
ing the following valid SGLD update rule, we can collect samples 
from the posterior using the distributed datasets: 

9t +1 «- Ot + | { V flt log p(9 t ) + ^g(9 t ; M s> ) J + vt. (12) 

Here, s is the index of the worker where a chain resides at iteration 
t, W <s) is the size of the local dataset at worker s, and v (s> is the 
normalized visiting rate to worker s such that t/ s) = 1 and 
i/ a) £ (0,1). The mini-batch Af ( t s) is sampled only from the local 
dataset of worker s. 

3. BAYESIAN MATRIX FACTORIZATION 
USING SGLD 

3.1 Model 

We will now show how DSGLD can be used for BPMF. Instead 
of the model described in Section HD we will use a slightly sim¬ 
plified model mini. We use the same likelihood as in eqn. [T] but 
choose simpler priors: 
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p(U|Ay) 

= nwi°.v), 

(13) 

p(V|Ay) 

M 

= n^i°,v). 

3=1 

(14) 


Here, Ay and Ay are D-dimension diagonal matrices whose d-th 
diagonal elements are A u d and Xy d respectively. We also choose 
the following hyper-priors: 

Su d Av d ~ Gamma(a 0 ,/3o)- (15) 

We choose this simplified model because the proposed method 
benefits mainly from performing a large number of inexpensive up¬ 
dates (i.e. collecting many samples) per unit time rather than very 
expensive but high quality updates. The above model is well suited 
for this because each latent vector can be updated in linear ( O(D )) 
time. At the same time, we still benefit from the power of Bayesian 
inference through marginalization of the important regularization 
parameters A = {Ay, Ay} as well as U and V. 

Although it is possible to apply our method to the model in 
Section [2~I| updating the full covariance matrix is more expensive 
(' 0(D 2 ) time per update) and therefore requires more time to con¬ 
verge without significant gain in accuracy (as per our pilot experi¬ 
ments). 

3.2 Inference 

In the following section, we first present our algorithm in a single 
machine setting and later extend it for distributed inference. We 
alternate between sampling from p(U, VjR, A) using SGLD and 
sampling from p(A|R, U, V) using Gibbs. 

3.2.1 Sampling U, V I A, R using SGLD 

Since, usually only N <C M X L ratings are observed, the 
rating matrix R is stored using a sparse representation as X = 
{x n = (pn,qn,r n )}n =1 . where each x n is a (user, item, rating) 
tuple and N is the number of observed ratings. The gradient of the 
log-posterior w.r.t]^ Ui is: 

1 v 

G(X) = J2g^(Ui-,X)-AuUi (16) 

71 = 1 

where 

9n(Ui-, X) = I [ Pn = i\X](r n - Uj n V qn )V qn (17) 

Here I[p n = i\X] is an indicator function that equals 1 if the n-th 
tuple in X pertains to user i and 0 otherwise. To use SGLD, we 
need an unbiased estimate of this gradient that can be computed 
cheaply from a mini-batch. 

One way to obtain this is by subsampling a mini-batch M = 
{(Pn, q-n, r n )}}T=i °f tn tuples from X and computing the follow¬ 
ing stochastic approximation of the gradient: 

G 1 (M) = Ng(U i ;M)-AuU i (18) 

where, g{Ui\M) = ^ 5Z™=i 9n(Uu M). Note that the mini¬ 
batch is subsampled from the complete dataset X and not just from 
the tuples associated with user i. The expectation of G i over all 
possible mini-batches is: 

E.m[GT(M)] = E M [Ng{Ui-,M)}-AuUi 

N 

= Y,^(Ui-,X)-AuUi 

71 = 1 

= G(X). 

Since Gi is an unbiased estimator of the true gradient, we can use 
it for computing SGLD updates. However, note that Gi is non¬ 
zero even for users that are not in the mini-batch M, because of 

2 We derive only w.r.t. Ui. Update rules for other parameters can 
be obtained by the same procedure. 




Figure 1: Block split schemes. 


the prior gradient term —Aj/C/j. Therefore, we have to update the 
parameters for all users in every iteration, which is very expensive. 

If we were to update only the parameters of users who have rat¬ 
ings in the mini-batch AT the estimator can be written as: 

G 2 (M) = Ng(Ui-,M)-I[i£M P ]AuUi (19) 

where I[i £ A4 P ] is equal to 1 if M contains a tuple associated with 
user i and 0 otherwise. However, G2 is not an unbiased estimator 
of the true gradient: 

N 

Ex[G 2 (A 1 )] = g„(Ui-, X) — hitAuUi. (20) 

71 = 1 


where hi* = E m [I[i £ Af p ]], i.e. the fraction of mini-batches that 
contains at least one tuple associated with user i (among all possible 
mini-batches). If the mini-batches are sampled with replacement, 
we can compute this as: 

hi* = 



where Ni* = 1 = i\X], the number of ratings by user i 

in the complete dataset X. Thus, we can remove the bias in G 2 by 
multiplying the gradient of the prior term with h~* as follows: 


G 3 (M) = Ng(Ui\M) - I[i £ Mpjh-'AuUi. (22) 


G 3 is an unbiased estimator of the true gradient G and is non-zero 
only for users that have at least one rating in AT Thus we need to 
update only a subset of user features in each iteration. The SGLD 
update rule (for users with ratings in Alt) is: 

Ui, t +i «- Ui, t + | ySTg{UiX,Mt) - + u t (23) 


3.2.2 Sampling A|U, V 

We can easily sample from the conditional p(A|U, V), because 


by conjugacy: 




- Gamma (a 0 + j, /3 0 + ^ ^ U# j , 

_ ( M . 1 ^ I/2 \ 

- Gamma U 0 + y , A) + ^ z2 V dj j ■ 

(24) 

AvJU.V- 

(25) 


If this is computationally demanding, we can also consider updat¬ 
ing A using SGLD or mini-batch Metropolis-Hastings fl7l !7i. 


3.3 Distributed Inference 

For distributed inference, we partition the rating matrix R into a 
number of blocks. Fig. [IJshows a few different ways of partitioning 
R. Two blocks are said to be orthogonal to each other if the users 
and items in one block do not appear in the other block. A set of 
two or more mutually orthogonal blocks is called an orthogonal 
block group (or simply, orthogonal group). For example, the two 


gray-colored blocks (1 and 4) in Fig. |T](a) are orthogonal to each 
other and thus form an orthogonal group. In Fig. |T](b), the blocks 
are not orthogonal because all columns are shared. In this case, we 
say that each block by itself is an orthogonal group. 

The blocks are then distributed to workers in such a way that all 
blocks are assigned and a worker has at least one block. In the fol¬ 
lowing, we assume for simplicity that each worker is a single-core 
machine. However, it is easy to generalize our algorithm to take 
advantage of multi-core (or threads) workers with shared memory 
support. 

We will now describe our distributed algorithm for BPMF. First, 
imagine that there is only one Markov chain c (but the dataset is 
distributed across multiple workers). A central parameter server 
holds the global parameters U c and V c of chain c. Since A de¬ 
pends only on U c and V c , it is easy to update A at the parameter 
server using Gibbs as per Eqns. [25]and[24] Thus, we will focus on 
the DSGLD part of the chain that samples from p(U, V|R, A). 

Each sampling round consists of the following steps: (1) The 
parameter server picks a block s via a block-scheduler and sends 
the corresponding sub-parameter U (c,s) and V lc, ' ) to the block’s 
worker. (2) The worker updates the sub-parameter by running DS¬ 
GLD (see section [3~.3.1| for update equations) for a number of itera¬ 
tions using its local block of ratings. (3) The worker sends the final 
sub-parameter state back to the parameter server. (4) The parameter 
server updates its global copy to the new sub-parameter state. 

Thus, the Markov chain jumps among the distributed blocks through 
the corresponding workers and updates the sub-parameters associ¬ 
ated with the block chosen in each round. Since each iteration of 
local DSGLD updates requires only a mini-batch of data, sampling 
is very fast. Also, communication overhead is low because a) the 
multiple local updates (iterations) performed within a round do not 
require any communication b) only a small sub-parameter associ¬ 
ated with a specific block is transferred in each round. There are 
two levels of parallelization that we use to further speed up sam¬ 
pling. 

1. Parallel updates within a chain: . A chain can update sub¬ 
parameters U (c,5l) and U <c ’ S;i) in parallel if the blocks si and S 2 
are orthogonal to each other. For example, in Fig. [T] (a), updating 
block 1 and then block 4 produces the same result as updating both 
in parallel. This makes the algorithm progress faster in terms of 
number of updated parameters per round. The actual performance 
improvement is dependent on the size of the orthogonal group. For 
instance, with a 4 x 4 split, the algorithm will update the parameters 
faster than with a 2 x 2 split because more parameter blocks can 
be updated in parallel. However, updates in smaller blocks can be 
noisier, because the gradients computed from smaller blocks will 
have higher variance. Therefore, at some point the loss in perfor¬ 
mance caused by noisier updates on small blocks can exceed the 
gain obtained by faster updating of the parameters. 

2. Multiple parallel chains: We can run as many chains in par¬ 
allel as we like, subject to only computational resource constraints. 
Each chain can update its parameters in parallel independent of 
other chains. Hence, the chains are asynchronous in the sense that 
the status of a chain does not block other chains unless the chains 
conflict for computation resources. For the split in Fig. |T](a), one 
chain can update using the gray block group while another chain 
is using the white block group. Or both chains can use the same 
block if we assume a shared memory multi-threaded implementa¬ 
tion. By running multiple chains in parallel, we effectively multiply 
the number of collected samples by the number of parallel chains. 
Since the variance of an MCMC estimator is inversely proportional 
to the number of samples, fast sample generation will compensate 
for the low mixing rate of SGLD. Also, by initializing the different 
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Figure 2: An example illustration. On the left, a matrix R is 
partitioned into 2x2 blocks, Bn, • • ■ , B 22 . There are two 
orthogonal groups (the gray (Bn,B 22 ) group and the white 
(Bi 2 ,B 2 i) group). We run two independent chains, chain a 
with parameters U“ and V" (solid-line rectangles) and chain 
b, with parameters U 6 and V 6 (dotted-line rectangles). Given 
four workers, we assign a block to each worker. At round t — 1, 
chain a updates using the gray orthogonal group and chain b 
using the white orthogonal group. Note that the entire U and 
V matrices of both chains are updated in this single round. In 
the next round, the chains are assigned to the next orthogonal 
groups by the block-scheduler. 


Algorithm 2 DSGLD at parameter server 

1: Initialize model parameters of each chain {UJ, VJ, 
step sizes {tt} 

2 : for each chain c parallel do 
3: for f = 1: max_iter do 

4: Be <r- GET_ORTHO_BLOCK_GROUP(c, t) 

5: for worker .8 7 B, do 

6 : U'tXl, V<° 4 _ s > <— WKR_ROUND(U' c '”\ V‘ c ' s) , A' t c \ tt) 

7: end for 

8 : if not bum-in then 

9: Store as a sample of chain c 

10: Sample A^U® 1; using Eqn. (24} and (25} 

11: end if 

12: end for 

13: end for 


chains in different places of parameter space, we can explore mul¬ 
tiple local minima. This is especially important for large-scale high 
dimensional problems where the time budget is usually not enough 
for a single chain to mix between different local minima. 

An illustration of these ideas is given in Fig. [2] Algorithms [2] 
and [3] describe the operations at the parameter server and workers 
respectively. 

A proper block splitting scheme can be chosen according to the 
characteristics of the problem and available resources. In other 
words, we can trade-off within-chain parallelization and between- 
chain parallelization. For example, given S workers, by using a 
squared split as in Fig. [T] (a), we can run \/S chains in parallel 
where each chain updates \^S blocks in parallel. This way we max¬ 
imize the within-chain parallelism. On the other hand, by reducing 
the size of orthogonal groups, we can decrease the within-chain 
parallelism in order to increase the between-chain parallelization, 
i.e. number of parallel chains. At an extreme of this approach, 
we can let each block become an orthogonal group by itself as in 
Fig. □ (b) and run S independent chains in parallel. Note that 
in this case, we can choose not only the column splitting but any 
splitting scheme. Our experiment results suggest to maximize the 
within-chain parallelism as the dataset size increases. For smaller 
datasets, we may benefit more from the generalization performance 
of a large number of parallel chains than from a smaller number of 
chains using the block orthogonality. 


Algorithm 3 DSGLD at worker s 

1: Initialize hi*, h,j, round length 7 , mini-batch size m 
2 : function wkr_round(U IC ’ 5) , V lc ’ s) , A <c) , t t ) 

3: for t = 1 : 7 do 

4: Sample a mini-batch Ait from X w 

5: for each user i and item j in A it parallel do 

6 : Update Ui, V-j using Eqn. (29} and (30} 

7: end for 

8: end for 

9: Send updated U lc,sl and V (c,s) to the parameter server 

10: end function 


3.3.1 Distributed SGLD 

Since X (the sparse representation of R) is partitioned into S 
blocks X m , ..., A' |S) , each worker uses only one of the X ,sl for 
computing updates. Thus, we need to modify the bias correctors 
in Eqn. ED so that the gradient estimator remains unbiased under 
this constraint. If we assume uf = 1 A’ <s) = X and nf = 1 A’ (s) = 0, 
and that worker s is visited with normalized frequency v (3 \ the cor¬ 
rection factors for users and items can be shown to be, respectively: 

s s 

hi, = h,j = (26) 

S=1 S=1 

where: 

' ,s = 1 -( 1 -sS)"’ '*' 5 ~ 1 -( 1 -s§) (27) 

here JV <s) = | X (s> \, the total number of ratings in s, and 

jy(. s ) jy( s ) 

N% = = J2 %" = ( 28 > 

n=1 n=1 


i.e. the number of ratings by user i and of item j respectively in s. 
Therefore, the local DSGLD update rule using block X ls) is: 


tw I- Ui,t + | { - 

ft ( A < “ ) 

Vj,t+i <- v it t + j | Mf) - 


A uUi,t \ 
hi* ) 

A yV jtt \ 

h,j J 


+ vt (29) 
+ u t . (30) 


The above rule updates only the sub-parameter associated with block 
s using only rating tuples in s. 


4. EXPERIMENTS 

4.1 Algorithms and Models 



Optimization 

MCMC 

Single Machine 

SGD 

SGLD. Gibbs 

Distributed 

DSGD 

DSGLD 


Table 1: Algorithms. 

We compared five algorithms: SGD, DSGD, SGLD, DSGLD, 
and Gibbs sampling. As shown in Table [T] each algorithm can 
be classified based on whether it is running on a single machine 
or a distributed architecture, and also based on whether it is an 
optimization or MCMC algorithm. Since Gibbs sampling was very 
slow, we update user/item features in parallel (as suggested in f28l ) 
using multiple cores of a single machine. Thus, by Gibbs sampling 




























































we will mean the parallelized (but not distributed) version from 
now on. 

For DSGLD, we tested two block-splitting schemes. Given S 
workers, DSGLD-S (‘S’ stands for square) partitions R into y/S x 
y/~S blocks as in Fig. [I] (a), i.e. DSGLD-S tries to maximize the 
within-chain parallelism by using as many orthogonal blocks as 
possible. We run VS parallel chains, where each chain updates VS 
sub-parameter blocks in parallel using y/S workers. Therefore, all 
chains can update all parameter at every round. The second split¬ 
ting scheme, called DSGLD-C (‘C’ stands for column blocks) di¬ 
vides R into S blocks as shown in Fig. |T|b). We split R along the 
rows because in our experiments we have many more users than 
items. The blocks in DSGLD-C are not orthogonal because all 
columns are shared, so we just run S independent parallel chains. 

For Gibbs sampling, we use the original BPMF mode0described 
in section 0 For the other algorithms, we slightly extend the 
model described in section [3~T| (as in I10II18I ). The extension in¬ 
cludes user and item specific bias terms a,i and bj respectively so 
that the predictions are modeled as: 

Rij ~ Ui Vj + a; + bj (31) 

We use the following priors and hyper-priors for a,i and bj: 

ai-A^O.A- 1 ), bj ~ JV(0, A^ 1 ), 

Aa, A b ~ Ganinta(ao,ft)- 


For U and V, we use the same priors and hyper-priors as described 
in Section 0 Note that, in the new model, we have to sample 
di, bj, A a , Ab in addition to U, V, Ajj, Ay. The DSGLD update 
rules for a t and bj are: 


Cbi,t +1 di,t + 

bj,t+1 bj,t + 


£t [ (V 1 ' 1 A /#( s )\ AaOi t \ 


2 [ v (s > 
e t f N M 
2 \ t)W 


+ v t (32) 


g(bj,t;M t s> ) - 


A bbj,t 


+ u t . (33) 


The main goal of our experiments is to answer the following ques¬ 
tions: 


• Accuracy. How does DSGLD compare to other methods in 
terms of prediction RMSE? 

• Speed.'. How fast can DSGLD achieve the RMSE obtained by 
1) optimization algorithms (SGD, DSGLD) 2) Gibbs sam¬ 
pling? 

• Factors which affect the above: The number of workers, num¬ 
ber of chains, block splitting schemes and the latent factor 
dimension. 

4.2 Setup 


Dataset 

# users 

# items 

# ratings 

Netflix 

480K 

18K 

100M 

Yahoo 

1.8M 

136K 

700M 


Table 2: Datasets. 

We compare all 5 algorithms on two large datasets, Netflix movie 
ratings (§] and Yahoo music ratings (3j (details in Table [2j. To the 
best of our knowledge, the Yahoo dataset was one of the largest 
publicly available datasets when we performed the experiments. 
Note that the Yahoo dataset we use here is different from the one 

3 Using the simplified model does not reduce the computation com¬ 
plexity of the Gibbs sampling. 


used in the KDD’ 11 Cup lfl2l (which has ~250M music ratings 
and is often referred to by the same name). For the Netflix dataset, 
we use 80% of the ratings for training and the remaining 20 % for 
testing as in CD. For the Yahoo dataset, the memory footprint was 
around 17GB for the train and test ratings, and around 1GB for U 
and V with D = 60 in our 64-bit float based implementation. The 
memory footprint of the Netflix dataset was relatively small. 

We used Julia m to configure the cluster and execute the core 
routines of the algorithms. The core routines were implemented 
in C for high performance. For distributed computing, we used 
Amazon EC2 instances jTj of type “r3" which were equipped with 
Intel Xeon 2.5 GHz CPUs and had memory configurable up to 
244GB. Although the instances had multiple cores, we restricted 
all algorithms, except Gibbs sampling, to run on a single-core. For 
Gibbs sampling, we used a 12-core machine with the same CPU 
speed. All algorithms were implemented as an in-memory execu¬ 
tion model and thus no disk I/O overheads were considered. 

We annealed the step size according to the schedule e t = e 0 (l + 
t/ k) y , (as in 15l |24| ) which satisfies the convergence conditions 
in Eqn. m We found ft, which controls the decay rate, over 
the range ft = [10, 50, 100, 500,1000,1500], The initial step size 
eo was also selected from [9e-6,le-6] for Netflix and [3e-6,8e-7] 
for Yahoo. More detailed settings are given in the Appendix. We 
decreased the stepsize after every round which we set to 50 updates. 
We used 7 = 0.51 in all experiments. 

We set the hyperparameters r = 2.0 and ao = 1.0 for all ex¬ 
periments. We used /3o = 1.0 for all algorithms except SOLD 
and DSGLD. For SGLD and DSGLD, the scale of the prior gra¬ 
dients sometimes became large due to multiplication by the bias 
correctors 1 /hi* and 1 /h*j. In this case, instead of increasing the 
mini-batch size to reduce the scale of the correctors, we used a 
more appropriate scale parameter for the Gamma prior distribution 
0o = 300), to stabilize the scale of precisions sampled from the 
posterior Gamma distribution. 

Mini-batch sizes were set to 50K data points for Netflix and 
100K for Yahoo. The initial values for the precisions A were all 
chosen to be 2.0 after testing over a range [10, 5, 2,1, 0.1, 0.01]. In 
SGLD and DSGLD, the precision parameters were sampled every 
50 rounds after burn-in. We discarded (burned) samples until the 
RMSE reached 0.85 for Netflix and 1.08 for Yahoo. For DSGLD, 
which deploys multiple chains, we used the arithmetic mean of the 
RMSE of all chains to determine whether burn-in has completed. 
We set the thinning interval to 10 rounds, i.e. we use only every 
10 th sample to compute the average prediction. The Gibbs sam¬ 
pler in our experiments was initialized near a MAP state which we 
found using SGD during burn-in. 

Running DSGLD requires a block scheduler (line 4 in Algorithm 
|2j that determines which blocks (workers) are used by each chain in 
a round. In our experiments, the blocks and the orthogonal groups 
were chosen beforehand and were assigned to chains deterministi¬ 
cally using a cyclic-shift (rotation) at every round with equal visit¬ 
ing frequency. This scheduling policy is illustrated in Fig. [2] 

4.3 Results 

4.3.1 Convergence and wall-clock time 

We first compare the RMSE of the algorithms as a function of 
computational time. In this experiment, we set D= 30 for both 
datasets and used 9 workers for Netflix and 16 workers for Yahoo. 
Given S workers, we used a y/S x y/S block-split for DSGLD-S, 
Sxl split for DSGLD-C and S x S split for DSGD. The total 
runtime was set to 50K seconds («14 hours) for Netflix and 100K 
seconds («27 hours) for Yahoo. In both Figs. [3] and [4] the x-axis 









Figure 3: Netflix dataset (D = 30). 




Figure 4: Yahoo Music Rating dataset (D = 30). 


is in log-scale for the figure on the left and in linear-scale for the 
figure on the right. 

In Fig. [3] we show results on the Netflix dataset (which is smaller 
than the Yahoo dataset). We see that in the early (burn-in) stage, 
all algorithms except Gibbs reduce error at a similar rate. Even 
though DSGLD-S and DSGD uses block orthogonality to update 
the sub-parameters of a chain in parallel, because of communica¬ 
tion overheads, the gain in speed-up is not enough to outperform 
a non-distributed algorithm like SGLD which is able to reduce the 
error at a similar rate (because the dataset size is not very large) 
without any communication overhead. Note that because there are 
many chains for DSGLD, we plot the RMSE from only one chain 
during bum-in. The variance of RMSE across the chains was small 
during burn-in. 

When the burn-in phase ends at around 500 - 700 seconds, MCMC 
algorithms (SGLD, DSGLD, and Gibbs) begin to collect samples 
and average their predictions over the samples, while DSGD does 
not and begins to overfit. Interestingly, at this point, we see a re¬ 
markably steep decrease in error for both DSGLD-S and DSGLD- 
C. In particular, we see the largest decrease for DSGLD-C which 
deploys 9 independent chains (whereas DSGLD-S uses 3 chains). 
Note that this is not solely a consequence of collecting a larger 
number of samples from multiple chains. We believe that the av¬ 
eraged prediction using many independent chains provides better 
generalization because many modes are likely to be explored (or, a 
large area of a single broad mode can be covered quickly if many 
chains reside there). After more investigation, we indeed observed 
that the same number of samples collected from a single chain (e.g. 
SGLD) cannot achieve the same level of accuracy obtained with 
multiple randomly initialized chains. Furthermore, we observed 
that given a lot more computational time, SGLD and DSGLD-S 


can approach the RMSE obtained by DSGLD-C as they also get a 
chance to explore other modes or to cover a larger area of a single 
mode. We will revisit the effect of multiple chains in more de¬ 
tail in the next section. Finally, note that Gibbs sampling achieves 
lower RMSE than DGSLD-C after around 20K seconds (5.5 hours) 
as shown in Fig. [3] left (but the difference to DSGLD-C is very 
small). Note that for this dataset, D and L + M were not too 
large and we used 12-core single machine for parallel Gibbs sam¬ 
pling. Therefore the computational cost of each iteration was not 
extremely high. 

We present our results on the Yahoo dataset in Fig. [4]with S = 
16 workers. A remarkable point is that, here, unlike with the Net¬ 
flix dataset, DSGLD-S outperforms DSGLD-C. This is because us¬ 
ing orthogonal blocks increases the number of parameters updated 
per round, resulting in increased convergence speed even after off¬ 
setting the communication overhead. As expected, a similar ef¬ 
fect is observed for DSGD. The progress of parameter updates in 
DSGLD-C is relatively slow, requiring S = 16 rounds to update 
all the parameters. Besides, DSGLD-C has a much larger commu¬ 
nication overhead because the full matrix V has to be transferred 
between the parameter server and each of the workers, whereas 
only a small block of V is transferred in DSGLD-S. Specifically, 
in DSGLD-C the parameter server sends and receives packets of to¬ 
tal size 0((L + SM)D) per round whereas in DSGLD-S the total 
packet size is only 0((L + M)D). Although DSGLD-C is rather 
slow during burn-in, after bum-in we still see a faster decrease in 
RMSE compared to SGLD because multiple chains can mix better. 
Gibbs sampling converges slower than it does on the Netflix dataset 
because for the Yahoo dataset the number of latent vectors to up¬ 
date, i.e. L + M, increases by a factor of four, and the number of 
ratings, N, by a factor of seven. 





















































(a) DSGLD-C on the Netflix dataset 



Figure 5: The effect of the number of chains, number of workers, and block split. 


For the Netflix dataset, after IK seconds, DSGLD-C achieved 
the RMSE (0.8145) that the Gibbs sampler obtains at 10K sec¬ 
onds. Similarly, after 1 IK seconds. DSGLD-S achieved the RMSE 
(1.0454) that the Gibbs sampler obtains at 100K seconds. There¬ 
fore, the proposed method converges an order of magnitude faster 
than Gibbs sampling on both datasets, which is especially impor¬ 
tant when we only have a limited computational budget. 

DSGD converges to a prediction RMSE of 0.8462 on Netflix and 
1.0576 on Yahoo after IK seconds and 10K seconds respectively. 
Given the same amount of computational time, DSGLD achieves 
an error of 0.8161 on Netflix and 1.0465 on Yahoo, a relative im¬ 
provement of 3.7% and 1.1%. After convergence, the RI increases 
to 4.1% for Netflix and 1.8% for Yahoo (See Table. [3j. 

4.3.2 Number of chains and workers 

We also investigated the effect of the number of chains and the 
number of workers. The results are presented in Fig. [5] Ac¬ 
cording to the observations from the previous experiment, we used 
DSGLD-C for Netflix and DSGD-S for Yahoo to study this effect. 
The latent feature dimension was set to D= 30. 

In Fig. [5] (a), we compare DSGLD-C with [1,3, 6, 9] chains 
(and workers) and in each case we evenly split the rows of the rat¬ 
ing matrix between the chains. Note that DSGLD-C (lxl) is the 
same as SOLD running on a single-machine. We see that during 
burn-in DSGLD-C (lxl) converges faster than the other splits be¬ 
cause there is no communication overhead. After burn-in, when the 
chains start averaging predictions, we see a sharp decrease in error 
for the other splits. Although splits with more chains decrease error 
much faster, they all eventually converge to a similar value. Due to 
poor mixing, a single chain (i.e. SOLD) converges very slowly. 

In Fig. [5] (b), we show results for DSGLD-S on the Yahoo 
dataset. We increased the number of workers to [1, 4,16, 36] to 
compare [1, 2,4,6] parallel chains. Again DSGLD-S (lxl) denotes 
SGLD running on a single machine. We see that SGLD converges 
much more slowly because the dataset is larger than Netflix and 
SGLD has to update more parameters sequentially. Using more or¬ 
thogonal blocks, DSGLD-S can update more parameters in parallel 
and we see more speed-up as we increase the number of workers. 
Although we increase the number of workers quadratically between 
the experiments, the packet size transferred between the parameter 
server and the workers stays constant at 0((L + M)D) because the 
block size also reduces accordingly. Even after burn-in (horizontal 
dotted black line at 1.08 RMSE) we see that with more chains we 
can decrease the error faster. This is because (i) multiple chains 
help to mix better by exploring a broader space (ii) each chain can 


mix faster by updating orthogonal blocks in parallel. 

4.3.3 Latent feature dimension 

In Fig. [6] we show how the latent feature dimension affects the 
final RMSE. The final RMSE on Netflix is measured after 50K sec¬ 
onds (14 hours) of computational time, because by then all algo¬ 
rithms had converged (except Gibbs sampling which is expected to 
take much longer). On the Yahoo dataset, we increased the compu¬ 
tational time to 100K secs (1 day), 200K secs (2.3 days), and 300K 
secs (3.5 days) for D=[30,60,100], respectively, to give the Gibbs 
sampler more time to converge. In table [3] we show the RMSEs 
of the different algorithms and the relative improvement (or dete¬ 
rioration) compared to DSGLD. The Relative Improvement (RI) of 
an algorithm x is defined as RI(x) = (rd — r x )/rd, where r x is 
the RMSE achieved by algorithm x and is the RMSE obtained 
using DSGLD. 

In both Fig. [6] (a) and (b), we see a large difference in per¬ 
formance between SG-MCMC (SGLD and DSGLD) and the op¬ 
timization methods (SGD and DSGD). The RI is 3.6% — 4.6% on 
Netflix and 1.8%—3.9% on the Yahoo dataset. As observed in l28l . 
we see that the optimization methods do not consistently improve 
with increasing D. One reason is that optimization methods are 
highly sensitive to the hyperparameter values which become dif¬ 
ficult to tune as the model becomes more complex. However, our 
method consistently improves as we increase D, because the hyper¬ 
parameters are sampled from their posterior distributions. We also 
see that the performance of Gibbs sampling on Netflix gets worse 
as D increases, because we used the same amount of computational 
budget for all D although the computation complexity increases as 
D does. On the Yahoo dataset on which we increase computational 
time as we increase D, we see that the RMSE for Gibbs increases 
as D increases, but is still lower than that of DSGLD. 

In Fig. |6](c), we compare the time (in seconds) required to draw 
a single sample for the three sampling algorithms at different values 
of D on the Yahoo dataset. We see that Gibbs sampling is almost 
two orders of magnitude slower than SGLD. For D=100, SGLD, 
DSGLD-S, and Gibbs generated 688, 460, and 8 samples respec¬ 
tively in 300K seconds of computational time. For Netflix, Gibbs 
generated around 100 samples in 50K seconds for D= 30. Thus, 
even though the Gibbs sampler can produce higher quality samples 
(in terms of lower auto-correlation), the sampling speed is so slow 
that it cannot satisfactorily handle large scale datasets. 

5. CONCLUSION 





















(a) RMSE on Netflix 




(c) Required time per sample 


Figure 6: The effect of the latent feature dimension, (a) and (b) show RMSE for D = [30, 60,100] on (a) the Neflix dataset and (b) the 
Yahoo music ratings dataset. The maximum computational time was set to 50K seconds for Netflix and 100K (D=30), 200K (0=60), 
and 300K (D=100) seconds for Yahoo, (c) shows time (in seconds) required to draw a single sample on the Yahoo dataset. 


D 

SGD 

DSGD 

SGLD 

DSGLD-C 

Gibbs 

D 

SGD 

DSGD 

SGLD 

DSGLD-S 

Gibbs 

30 

0.8421 

-3.63% 

0.8462 

-4.13% 

0.8143 

-0.21% 

0.8126 

0.8118 

+0.09% 

30 

1.0578 

-1.83% 

1.0576 

-1.82% 

1.0448 
-0.58 % 

1.0387 

1.0454 

-0.64% 

60 

0.8447 

-4.62% 

0.8428 

-4.38% 

0.8097 

-0.28% 

0.8074 

0.8259 

-2.29% 

60 

1.0548 

-2.73% 

1.0588 

-3.13% 

1.0351 

-0.82% 

1.0267 

1.0364 

-0.94% 

100 

0.8415 

-4.63% 

0.8395 

-4.37% 

0.8082 

-0.48% 

0.8043 

0.8339 

-3.68% 

100 

1.0567 

-3.30% 

1.0631 

-3.93% 

1.0335 

-1.04% 

1.0229 

1.0339 

-1.08% 


Table 3: RMSE and relative improvement (RI). Left: Netflix. Right: Yahoo. The percentage shown below each RMSE value is the 
relative improvement. 


Most applications of matrix factorization to recommender sys¬ 
tems are based on stochastic gradient optimization algorithms be¬ 
cause these are the only ones that can computationally handle very 
large datasets. However, by restricting ourselves to such simple al¬ 
gorithms, we miss out on all the advantages of Bayesian modelling 
such as quantifying uncertainty, controlling over-fitting, incorporat¬ 
ing prior information and better prediction accuracy. In this paper, 
we introduced a novel algorithm for scalable distributed Bayesian 
matrix factorization that achieves the best of both worlds, i.e. it 
inherits all the advantages of Bayesian inference at the speed of 
stochastic gradient optimization. 

Our algorithm, based on Distributed Stochastic Gradient Langevin 
Dynamics, uses only a mini-batch of ratings to make each update 
as in Stochastic Gradient Descent optimization. By running mul¬ 
tiple chains in parallel, and also using multiple workers within a 
chain to update orthogonal blocks, we can scale up Bayesian Ma¬ 
trix Factorization to very large datasets. Parallel chains with differ¬ 
ent random initializations also help us to average predictions from 
multiple modes and improve accuracy. Moreover, our algorithm 
can effectively handle datasets that are distributed across multiple 
machines unlike traditional MCMC algorithms. 

We believe that our method is just one example of a much larger 
class of scalable distributed Bayesian matrix factorization methods. 
For example, we can consider using more sophisticated stochastic 
gradient algorithms a mi nni hd in place of SGLD to further 
improve the mixing rate. 
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APPENDIX 

A. STEP-SIZE PARAMETERS 



SGD 

DSGD 

SGFD 

DSGFD-C 

DSGFD-S 

eo 

9e-6 

le-6 

9e-6 

9e-6 

3e-6 

K, 

50 

10 

1000 

1000 

500 

Table 4: Stepsize parameters for Netflix D= 30 and 9 workers 



SGD 

DSGD 

SGLD 

DSGFD-C 

DSGFD-S 

eo 

1.5e-6 

3e-7 

1.5e-6 

9e-7 

1.5e-6 

Av 

500 

100 

1000 

1000 

500 


Table 5: Stepsize parameters for Yahoo D= 30 and 16 workers 



